Dual Attention and Patient Similarity Network for drug recommendation

Abstract Motivation Artificially making clinical decisions for patients with multi-morbidity has long been considered a thorny problem due to the complexity of the disease. Drug recommendations can assist doctors in automatically providing effective and safe drug combinations conducive to treatment and reducing adverse reactions. However, the existing drug recommendation works ignored two critical information. (i) Different types of medical information and their interrelationships in the patient’s visit history can be used to construct a comprehensive patient representation. (ii) Patients with similar disease characteristics and their corresponding medication information can be used as a reference for predicting drug combinations. Results To address these limitations, we propose DAPSNet, which encodes multi-type medical codes into patient representations through code- and visit-level attention mechanisms, while integrating drug information corresponding to similar patient states to improve the performance of drug recommendation. Specifically, our DAPSNet is enlightened by the decision-making process of human doctors. Given a patient, DAPSNet first learns the importance of patient history records between diagnosis, procedure and drug in different visits, then retrieves the drug information corresponding to similar patient disease states for assisting drug combination prediction. Moreover, in the training stage, we introduce a novel information constraint loss function based on the information bottleneck principle to constrain the learned representation and enhance the robustness of DAPSNet. We evaluate the proposed DAPSNet on the public MIMIC-III dataset, our model achieves relative improvements of 1.33%, 1.20% and 2.03% in Jaccard, F1 and PR-AUC scores, respectively, compared to state-of-the-art methods. Availability and implementation The source code is available at the github repository: https://github.com/andylun96/DAPSNet.


Introduction
In recent years, the widespread usage of patient Electronic Health Records (EHRs) has promoted the development of intelligent healthcare, which provides clinical decision support for doctors and improves the quality and efficiency of disease diagnosis and treatment recommendations (Li et al., 2020;Ma et al., 2017Ma et al., , 2020aNguyen et al., 2021;Zhang et al., 2020). Drug recommendation, as an important task, can provide effective and safe prescriptions for doctors' reference (Cheng et al., 2019;Choi et al., 2016a;Li et al., 2017). Existing work learns patient representations (PRs) by capturing temporal patterns of patient medical information in EHRs to accurately predict drug combinations. Such process can be carried out in two ways: (i) Instance-based drug recommendation (Gong et al., 2021;Zhang et al., 2017) that only uses the patient's current diagnosis and procedure records for drug recommendation while ignoring the longitudinal patient history information. Therefore, the instance-based methods cannot capture the patient's historical disease process. In order to overcome this issue, (ii) longitudinal-based drug recommendation (Choi et al., 2016b;Shang et al., 2019;Wang et al., 2021;Wu et al., 2022b;Yang et al., 2021a, b) that models longitudinal patient history in the temporal dimension and combines learned PRs with drug representations to predict drug combinations.
However, considering the complexity of patients' clinical treatment, the existing methods still suffer from two major challenges. (i) Inadequate PR. Most existing works (Choi et al., 2016b;Shang et al., 2019;Yang et al., 2021b) only regard the diagnosis and procedure information in the patient's historical visits as two independent views, these methods encode the medical information separately and concatenate them together to obtain the PR (Hochreiter and Schmidhuber, 1997). However, on the one hand, the obtained PR is not comprehensive, as it neither takes into account the impact of historical prescriptions on the patient's disease state nor the relationship between medical perspectives on different dimensions of the disease trajectories. On the other hand, most deep learning methods are equipped with multi-layer neural networks to model the longterm visit dependence of patients. But more layers will lead to the increase of mutual information between the obtained PR and the input patient information, as well as the oblivion of critical information reflecting the patient's disease state. (ii) Inaccurate patient similarity. The existing works (Suo et al., 2017;Zhang et al., 2014) exploit similarities between patients' global representations to provide recommendations. However, patients with similar disease courses are the basis of personalized prediction, considering that in clinical practice the course of patients is often long and complex (Allam et al., 2020). Therefore, compared with calculating the similarity of patient status, the similarity of patient global representation may lose or underutilize the correlation between patient disease processes to accurately match drugs and disease status.
To overcome the above challenges, we propose a deep learning model Dual Attention and Patient Similarity Network (DAPSNet) that fully utilizes the longitudinal information into PRs while integrating patient similarity to improve the performance of drug recommendation. Specifically, our DAPSNet consists of: (i) a PR module that utilizes code-and visit-level attention mechanisms to encode comprehensive PRs by integrating diagnosis, procedure and drug information from patients' historical visits and current visit. (ii) A patient retrieval module, which constructs a patient representation memory (PM) to store the disease state representations and corresponding drug combinations of different patients, and further retrieves corresponding drug information based on the similarity of the patient's current state with its own historical state and other patients' historical states. (iii) A drug recommendation module that learns patient-drug matching by concatenating the PRs and the captured disease state similarity based drug information, then predicting the drug combinations via a sigmoid layer.
In the training stage, the DAPSNet is optimized by multiple loss functions. To reduce the information loss when learning PRs and enhance the robustness of the model, we introduce an information constrained (IC) loss function based on the information bottleneck (IB) principle (Tishby and Zaslavsky, 2015). The IC loss function aims to maximize the mutual information between PRs and labels, while minimizing the mutual information between PRs and the input patient information. Moreover, we adopt the multi-label prediction (MP) loss and drug-drug interaction (DDI) loss for guiding the model in making accurate and safe predictions. Experiments on the public MIMIC-III dataset demonstrate the effectiveness and safety of our model.
Our main contributions are summarized as follows: • We propose DAPSNet, a novel drug recommendation model that predicts accurate and safe drug recommendation by leveraging various medical information in a patients' history visits as well as the similarity of patients' disease states. • We design a novel PR module that obtains a comprehensive PR by combining the patient's visit information encoded by codeand visit-level attention mechanisms. • We introduce an information constraint loss to constrain the learned representation and further enhance the robustness of our model. • We design a novel patient retrieval module that contains a PM that including all patients' states representations and corresponding drug combinations. Furthermore, we retrieve corresponding drug information based on the patient state similarity to improve the performance of drug recommendation. • We conduct extensive experiments on the MIMIC-III database with several state-of-the-art methods, our model outperforms the best baselines with 1.33%, 1.20%, 2.03% and 0.59% in Jaccard similarity, F1-score, Precision Recall Area Under Curve (PR-AUC) and Receiver Operating Characteristic Area Under Curve (ROC-AUC), respectively. The experimental results demonstrated that our proposed DAPSNet is effective, safe and robust.

Related work
According to the PR learning strategies, existing drug recommendation methods can be divided into rule-based, instance-based and longitudinal-based drug recommendation.
1. Rule-based drug recommendation. Rule-based drug recommendation (Almirall et al., 2012;Chen et al., 2016) relies on the medical guidelines summarized by clinicians, which requires a lot of medical resources and efforts. For example, Chen et al. (2016) designs the knowledge patterns system to recommend treatment with the patient's medical information. However, these methods are highly limited and lack of generalization. 2. Instance-based drug recommendation. Instance-based drug recommendation (Gong et al., 2021;Zhang et al., 2017) only learns the PR from the current visit. For example, Zhang et al. (2017) encode the patient's current visit with the attention mechanism and proposed a multi-instance multi-label learning framework. Gong et al. (2021) leverage multiple data sources to learn the embeddings of the patient-disease-medicine relations by the knowledge graph for drug recommendation. However, these methods ignore the longitudinal patient historical information. 3. Longitudinal-based drug recommendation. Longitudinal-based drug recommendation (Choi et al., 2016b;Shang et al., 2019;Wu et al., 2022a, b;Yang et al., 2021a, b) Wu et al. (2022b) transform the drug recommendation task into the problem of drug change prediction. MICRON (Yang et al., 2021a) is proposed to model the health condition changes by a recurrent residual learning approach, and COGNet (Wu et al., 2022b) uses the copy-or-predict mechanism to model the relationship between drug changes in patients' continuous visits.
However, few works have focused on constructing patient disease state representations that can reflect the temporal complexity of disease processes. Furthermore, due to the limitations of the model, it is difficult for existing methods to capture the patients' long range visit dependency. This article utilizes dual view attention mechanisms to encode comprehensive PRs while integrating patient similarity to improve the performance of drug recommendation.

Materials and methods
In this section, we first define the notation and formulate the drug recommendation problem. Thereafter, we introduce the proposed DAPSNet.

Preliminaries and problem formulation
3.1.1 Preliminaries Preliminaries The longitudinal EHRs contain a variety of sequential medical codes of patients, e.g. diagnosis, procedures and drugs. Each patient can be represented as a series of clinical treatment events, taking patient i as an example, X i ¼ ½x i ð1Þ ; x i ð2Þ ; . . . ; x i ðTiÞ , where i 2 f1; 2; Á Á Á; Ng; N is the number of all patients in the dataset, and T i denotes the total number of visits for patient i. We utilize a tuple ½d ðtÞ i ; p ðtÞ i ; m ðtÞ i to represent the clinical visit x ðtÞ i of the patient i, where d i ðtÞ 2 f0; 1g jDj ; p i ðtÞ 2 f0; 1g jPj and m i ðtÞ 2 f0; 1g jMj are multi-hot diagnoses, procedure and drug vectors, respectively, while D; P; M are the diagnosis, procedure and drug sets, respectively. Meanwhile, the disease state from x i ð1Þ to x i ðtÞ of patient i is denoted as X i 1:t . EHR&DDI Graphs We use G E ¼ fM; E E g and G D ¼ fM; E D g to denote the EHR graph and DDI graph, respectively, where M is the drug set, E E and E D are the edge set of the prescription in EHRs and known DDIs from external knowledge, respectively. We use the adjacency matrix A E ; A D 2 R jMjÂjMj to store the edge information in E E and E D . A E ½i; j ¼ 1 means that the ith drug and the jth drug appear in the prescription in the same visit, A D ½i; j ¼ 1 represents the adverse reaction between the ith drug and the jth drug.

Problem formulation
In our drug recommendation task, we aim at predicting the drug combination setm ðtÞ 2 f0; 1g jMj for different patient at visit t. Assuming that at the visit time t of patient i, given the patient disease state X i 1:tÀ1 , the current diagnoses and procedures code ½d ðtÞ i ; p ðtÞ i , EHR graph G E and DDI graph G D . Our model aims to minimize the gap between the current predictionm i ðtÞ and the real prescription m i ðtÞ 2 f0; 1g jMj . The main notations used in this article are listed in Table 1.

DAPSNet
As illustrated in Figure 1, our DAPSNet consists of the following three components: (i) the PR Module that learns the patient disease state representation from the medical codes in the longitudinal history data; (ii) the Patient Retrieval Module that utilizes the patient similarity to generate additional PRs. (iii) the drug recommendation Module that predicts the drug combinations based on the concatenated patients' representations. Each component of the DAPSNet is detailed below in turn.

(i) PR Module
In the longitudinal EHRs, each medical code plays an important role in PR. The diagnosis, procedure and drug information reflect the patient's health status, treatment process and historical prescription, respectively. In order to make full use of these medical information, we design a patient encoder, which includes the designed embedding tables of different medical codes. We first embed the disease state from x i (1) where j 2 f1; 2; . . . ; tg indicates the index of visit while the medical code are diagnosis and procedure information, j 2 f1; 2; . . . ; t À 1g when the medical code is drug information. E d 2 R jDjÂdim ; E p 2 R jPjÂdim ; E Ã m 2 R jMjÂdim are the embedding tables of diagnosis, procedure and drug, respectively (E Ã m will be explained in the next section), and d

Drug Graph encoder
There are two kinds of graph structure information in the EHR data and external knowledge. The EHR graph contains the information that some drugs are prescribed at the same time to improve the curative effect, and the DDI graph contains the information that some drugs have adverse reactions and cannot be used at the same time. Inspired by the GAMENet (Shang et al., 2019) using the Graph Convolutional Network (GCN) (Kipf and Welling, 2016) to encode the drug representation. In order to recommend effective and safe drug combinations, we encode the EHR graph G E and DDI graph G D to obtain the drug representation.
Given the input drug embedding table E m 2 R jMjÂdim and the drug adjacency matrix A Ã 2 R jMjÂjMj , we use the GCN layer to obtain the drug representations as follows: whereD is a diagonal matrix ofÂ Ã (e.g.
Then, we use a two-layer GCN to model the improved embeddings on each graph. We model the co-occurrence relations and DDIs based on the EHRs and DDIs adjacency matrix A E and A D separately. (3) where W e and W d are the hidden learnable parameter matrices and G e and G d are the generated drug relation representations. Finally, we fuze two generated relation representations G e , G d together to obtain the final drug representation E Ã m , where d is the learnable parameter to fuze different relation graphs.

Attention mechanisms
Inspired by the AMANet (He et al., 2020), which learns the intra-view interaction and inter-view interaction for dual asynchronous sequential learning through the self-and inter-attention mechanisms, respectively. In the EHRs data, we treat the diagnosis, procedure and drug in the disease process as three sequential views. Based on the above embedding process, in order to select the relative important visits in the disease process and the critical medical code in each visit, we design two different attention mechanisms, namely, the code-level attention and visit-level attention, which give different weights to different medical codes and different visit records in the patient disease process.
Firstly, in order to select the critical medical code in each visit (e.g. e i ðjÞ ¼ ½d , which can be measured by: Notation Description where W g and b g are the learnable parameters. Moreover, in order to select the relative important visits in the patient disease process e i 1:tÀ1 , we design the visit-level attention to obtain weights corresponding to each visit b , which can be measured by: where W h is the learnable parameter. Therefore, we can select the critical medical code elements in each visit and the relatively important visit in the patient disease process to jointly obtain the final PR q ðtÞ i by combining the above two attention mechanisms with the patient's disease process and the current medical code. The calculation process is as follows: where q ðtÞ i is the representation of the patient i with t times visit, the embedded vector e i t ¼ ½d ðtÞ ei ; p ðtÞ ei is the current medical code and is the element-wise multiplication.
The PR module of our model is composed of embedded layers of different medical codes and two different attention mechanisms. Compared with the previous work, DAPSNet can learn a more comprehensive patient disease state representation in the longitudinal patient EHRs data, which is helpful for the subsequent patient retrieval module to measure the similarity accurately.

(ii) Patient Retrieval Module
We record each visit of N patients in the EHRs as fX 1 1:T1 ; . . . ; X N 1:TN g. Our patient retrieval module first constructs a PM to store the disease state representations and the corresponding drug combinations of different patients, and further calculates the similarity information between the PRs from PM to retrieve the corresponding drug information. The retrieval process can be separated into the following three steps.
First, we build a PM to store the patient disease state representations fq ðkÞ j g j¼N;k¼Tj j¼1;k¼1 learned from the PR module and the corresponding drug combinations fm ðkÞ j g j¼N;k¼Tj j¼1;k¼1 in each visit. a. Current Similarity: Next, we calculate the similarity between the patient's current representation q ðtÞ i and the drug representation E Ã m , which we record as the Current Similarity sim C . Using the current similarity, we can directly retrieve the Current Similarity drug information C ðtÞ i , where the Current Similarity sim C ðÁ; ÁÞ calculates the similarity matrix between the PR q ðtÞ i and the drug representation E Ã m , then, we use Softmax function to normalize the weight matrix. b. Historical Similarity: Due to the temporal complexity of patient disease processes, take patient i as example, the disease state representation at visit t of the target patient i: q where o ðtÞ i 2 R jMj denotes the final matching scores for the patient i. By comparing the matching scores o ðtÞ i to a pre-defined threshold parameter w, we can obtain the final drug combinationsm i ðtÞ 2 R jMj predicted by our model.

(iv) Loss Function
Our DAPSNet is trained with three loss functions: (i) a DDI Loss for explicitly constraining the DDI rate in the drug combinations prediction, (ii) a MP Loss for accurately predicting the drug combinations and (iii) an Information Constraint Loss to enhance the robustness of the model with utilizing the IB principle. We simultaneously optimize the learnable parameters during the training process.
i. DDI Loss: For the drug combinations, we want to achieve a lower DDI rate, which will reduce adverse reactions and realize the prediction of safe drug recommendation. Based on the DDI adjacency matrix A D , we design the DDI loss for a single visit o ðtÞ i as: During the training, the model will conduct back propagation according to the average DDI loss of all visits.
ii. MP Loss: We consider the drug recommendation as a multilabel binary classification task, and use two common multilabel loss functions. The first one is Multi-Label Margin (MLM) loss (Ji and Ye, 2009), which is popular in existing drug recommendation works, such as GAMENet (Shang et al., 2019), SafeDrug (Yang et al., 2021b) and COGNet (Wu et al., 2022b). The MLM loss ensures the predicted probability of ground truth labels has at least 1 margin larger than others, which can be mathematically described as: The second one is the Binary Cross-Entropy (BCE) loss, which can be formulated as: The MP loss is formulated by combining the MLM loss and BCE loss with a balance hyper-parameter l: iii. Information Constraint Loss: In order to obtain a compact and comprehensive PR, we extend the IB principle to the drug recommendation task. Specifically, in this work, we encourage to minimize the mutual information between the latent PRs q and the input medical codes X while maximizing the mutual information between the latent representations q and the drug labels m, minIðX; qÞ À uIðq; mÞ: According to the variational approximation of the IB (Alemi et al., 2016), we can get the lower bound of Iðq; mÞ, thus, the latter term of the objective function Equation (21) is equal to the BCE loss mentioned above.
According to the definition of the mutual information, IðX; qÞ ¼ HðqÞ À HðqjXÞ, during the training, the model will get a definite latent PR q with given medical codes input X, so the conditional entropy of q given X: HðqjXÞ ¼ 0, Therefore, minimizing the mutual information IðX; qÞ can be estimated to minimize the entropy of q, H(q). The latent PRs q consists multiple visits q t , given a real valued positive definite kernel j and the Gram matrix K, where K i;j ¼ jðq i ; q j Þ, where q i and q j are the representations of the i-th and j-th samples in a batch, respectively. We use a matrix-based analogue to Re ' nyi's a-entropy to approximate calculate H(q) (Yu et al., 2021), where a 2 ð0; 1Þ \ ð1; þ1Þ; A is the normalized version of K; A ¼ K trðKÞ ; k i ðAÞ denotes the i-th eigenvalue of A. To simplify the formulation of IB, we define the last item of Equation (23) as the information constraint loss L IC . iv. Overall Loss Function: During the training process, the overall loss function L is obtained by combining the three loss functions through the weighted sum to optimize the drug recommendation network, where k 1 , k 2 and k 3 are the weights for different loss functions.
(v) Algorithm Our training algorithm is detailed in Algorithm 1.

Dataset
We evaluate the effectiveness and safety of the proposed DAPSNet and baselines on the public MIMIC-III database (Johnson et al., 2016). Our dataset is processed according to the protocol proposed by PhysioNet (Goldberger et al., 2000). Following the datapreprocessing in the previous work (Shang et al., 2019;Yang et al., 2021b), we choose the drugs within the first 24 h and only keep the patients with more than one visit in our dataset. The diagnosis and procedure data are coded by the ninth version of International Classification of Diseases (http://www.icd9data.com/). We extract DDI information of the top-40 most common types from TWOSIDES (Tatonetti et al., 2012), where the drugs are presented by ATC third level codes (https://www.whocc.no/atc/structure_and_principles/). In order to integrate the DDI data and compute the DDI score, we transform the NDC codes to the same ATC third level codes. We stratified patients with different visit times in the dataset. The statistics of the processed MIMIC-III dataset are summarized in Table 2.

Implementation details
Following the previous drug recommendation work (Shang et al., 2019;Wu et al., 2022b;Yang et al., 2021b), we divided the dataset into training, validation and test set as 2 3 : 1 6 : 1 6 . Our method is implemented by PyTorch (https://pytorch.org) 1.6.0 based on python 3.7.5, tested on an Intel Xeon CPUs with two NVIDIA 2080Ti GPUs. We choose the optimal hyper-parameters in our model based on the validation set, the dimension size is set to 64 and the threshold d is set to 0.5. The weight l, k 1 , k 2 and k 3 in our overall loss are set to 0.05, 0.2, 0.75 and 0.05, respectively. We use a 2 Â 10 À4 learning rate to train our model within 50 epochs. Our model is optimized by the Adam optimizer (Kingma and Ba, 2014). All the baselines are trained and implemented with the optimized parameters from the references. Table 3 demonstrates the experiment results of the proposed DAPSNet and baselines. We conduct 10 rounds of tests for all the models and report their metric scores' mean and standard deviation. Overall, our proposed model DAPSNet outperforms all baselines in terms of 1.33%, 1.20%, 2.03% and 0.59% improvement in Jaccard 14: end for 15: Generate and accumulate L DDI ; L mp ; L IC in Equations (18) and (23), respectively; 16: end for 17: Optimize the combined loss L in Equation (24); similarity, F1-score, PR-AUC and ROC-AUC, respectively. Compared with the DL-models that utilize longitudinal patient information (e.g. RETAIN, DMNC, GAMENet, SafeDrug, MICRON and COGNet), the instance-based models (e.g. LR, ECC and LEAP) that only consider the current visit shows poor results in accuracy prediction. At the same time, the DDI rate of the predicted drug combinations is similar to the MIMIC-III dataset itself (average DDI: 0.08379).

Performance comparison
In detail, RETAIN and DMNC only encode the patient's historical information and do not introduce external knowledge into the models. In contrast, GAMENet improves the model's performance by encoding the drug embedding of the external graph structure and constructing the patient memory bank, but it provides a high DDI rate. SafeDrug models the molecular graph in drug encoding and introduces the DDI controllable loss function, resulting in a further performance improvement and ensuring the lowest DDI rate among SOTA methods. Following the parameter chosen in SafeDrug, we set the DDI threshold in SafeDrug to 0.06. Different from the above models, MICRON and COGNet noticed that there is a correlation between drug combinations in two consecutive visits. MICRON uses the recurrent residual method to predict the unchanged drugs. Considering the correlation between drugs, COGNet introduces a copy-or-predict mechanism to determine whether historical prescriptions are still relevant, which further improves the performance but maintains a high DDI rate because of no DDI constraint. Compared with the above SOTA model, our DAPSNet achieves higher prediction accuracy under DDI constraints. Experimental results show that our model can balance the accuracy and safety of prediction through the PR and patient retrieval module and further predict safer and more effective drug combinations than the existing methods.

Historical information utilization
To further explore the ability of model DAPSNet to capture the medical information in historical visits, we conduct experiments on the performance of different models to investigate the impact of the number of visits in the dataset. According to the dataset statistics in Table 2, in our dataset, the average number of visits for different patients is 2.37. The proportion of patients with more than five visits in the dataset does not exceed 10%. So, we stratify the datasets based on different number of visits to study its impact on the performance of different models. As a comparison, here, we choose the recent SafeDrug, MICRON and COGNet as stronger baselines. The comparison results of various methods on different number of visits are in Figure 2. The horizontal axis represents the patients visit times and the vertical axis represents the values of the different evaluation metrics. The results show that DAPSNet almost achieves the best performance with different visit times. With the increase in visits, the performance of DAPSNet and COGNet has been further improved, showing that both models effectively use patient historical information. On the contrary, the performance of SafeDrug using RNN to model the patient history decreased, but the overall performance remains unchanged. The performance of MICRON shows a decreasing trend under different visit times, because the drug change measurement mechanism will lead to error accumulation and MICRON only predicts the unchanged part. In conclusion, we can see that our DAPSNet can stably recommend safe and effective drug combinations with more visits while comparing with other models.

Model ablation study
In this section, we verify the effectiveness of each module in DAPSNet. Specifically, we design the ablation studies on our dataset and test on the following variants: • DAPSNet w=o D: we remove the diagnoses information in each visit. • DAPSNet w=o P: we remove the procedures information in each visit. • DAPSNet w=o M: we remove the medications information in each visit. • DAPSNet w=o a: we remove the code-level attention mechanism in PR module. • DAPSNet w=o b: we remove the visit-level attention mechanism in PR module. • DAPSNet w=o a; b: we remove both the code-and visit-level attention mechanisms. • DAPSNet w=o G DDI : we remove the DDI graph in encoding the drug representation. • DAPSNet w=o G EHR : we remove the EHR graph in encoding the drug representation.
Jaccard score F1 score PR-AUC score ROC-AUC score • DAPSNet w=o G Drug : we remove the EHR and DDI graphs in encoding the drug representation. • DAPSNet w=o H Pat i : we remove the set of patient historical similarity representations. • DAPSNet w=o H i : we remove the set of similarity representations. • DAPSNet w=o L DDI : we remove the DDI loss function. • DAPSNet w=o L IC : we remove the Information Constraint loss function. Table 4 shows the results for the different variants of DAPSNet. By comparing the ablation results, DAPSNet w=o D, DAPSNet w=o P and DAPSNet w=o M yield poor results among all ablation models, which suggest that the medical information, diagnosis, procedure and medication information, play an important role in the drug recommendation. DAPSNet w=o a, DAPSNet w=o b and DAPSNet w=o a; b indicate that the code-and visit-level attention mechanisms in PR learning module bring an improvement to the recommendation performance. DAPSNet w=o G DDI , DAPSNet w=o G EHR and DAPSNet w=o G Drug illustrate that the DDI graph and EHR graph in the drug graph encoding module can not only learn a more comprehensive drug representation, but also improve the performance of the model. Further, we compare the DAPSNet w=o H Pat i and DAPSNet w=o H i with our model, the result illustrates that the patient historical similarity and the personal historical similarity improve the accuracy and completeness of the patient retrieval module, thereby improving model performance. The results of DAPSNet w=o L IC indicate that IB principle constrains the learned PRs and has contributions to the final result. Compared to the variant without DDI loss L DDI , we found that DAPSNet w=o L DDI has better results in some indicators(e.g. Jaccard, F1, PR-AUC and ROC-AUC), but the DDI rate is much higher than DAPSNet. Without the constraint of DDI loss function, DAPSNet w=o L DDI has a similar DDI rate with the MIMIC-III dataset itself and the average number of recommended drugs has increased, which suggests that our DAPSNet mimics the behavior of physicians and provides better performance in prescribing drug combinations. Overall, comparing with all variants, DAPSNet achieves more balanced and accurate result in drug recommendation.

Detailed components study
In order to better illustrate the effectiveness of our model and to investigate the performance of different models for learning PR and measuring disease trajectories similarity. Specifically, we further design the ablation studies with the SOTA methods on our dataset and test on the following variants:    Note: (Ã ð#NÞ ) indicates that this wrongly predicted drug Ã has interactions with N correctly predicted drugs.
• RNN PatRepr þ DAPSNet: we replaced the PR module of DAPSNet with the PR module that adopts RNN. • Transformer PatRepr þ DAPSNet: we replaced the PR module of DAPSNet with the PR module that adopts Transformer block. Table 5 shows the results for the different variants. We conducted two sets of experiments. In the first set of experiments, we aim to explore the impact of DDI graph and EHR graph on the model performance in drug encoding. We construct different models without DDI knowledge and compare their performance. Since COGNet w=o G Drug removes both the EHR graph and the DDI graph, we choose DAPSNet w=o G Drug for a fair comparison. Due to the constraint of the DDI threshold (0.06), SafeDrug w=o Local reaches the lowest DDI rate. From the comparison between the results in the table and the results of the original model, it can be seen that removing the DDI graph will have a negative impact on the DDI rate and performance of the model. Meanwhile, DAPSNet w=o G DDI and DAPSNet w=o G Drug can maintain lower DDI rate and higher prediction accuracy compared to other variants, which shows that the whole framework is indeed effective.
In the second set of experiments, we aim to investigate the differences between the PR learning modules of different models. We adopt the PR module of the DAPSNet model in GAMENet, SafeDrug and COGNet, respectively, and try the RNN-based PR module, which adopt in GAMENet and SafeDrug and the Transformer-based PR module, which adopt in COGNet into our model, respectively. We further compare the above variants with the original models. From the results, we can see that with adopting our PR module, the variants of GAMENet, SafeDrug and COGNet have significantly lower DDI rates and better accuracy compared to the model themselves. Furthermore, we replace the PR module of DAPSNet with the PR learning method of these SOTA models. There are two existing methods: one is GAMENet and SafeDrug, which use RNN to encode diagnosis and treatment information respectively, and the other is COGNet, which use Transformer block to encode the medical information separately. These two variants of DAPSNet show poor result in both DDI rate and model performance compared with DAPSNet itself. The results of the second set of experiments demonstrate that the PR learning module in our model is comprehensive and effective.

Case study
We provide a case study to show our DAPSNet's effectiveness. We choose a patient with five visits from the test set and use GAMENet, SafeDrug and COGNet to predict the drug combinations based on their historical medical records. The detailed diagnosis IDs, the recommended drug ATC-third code (in MIMIC-III) and drug combination DDI rate of the selected patient in each visit are provided in Table 6. In addition, we use Figure 3 for concise and intuitive display. We use ICD-9 codes to represent diagnosis records and the ATC-third codes to represent recommended drugs. Here, the 'missed' in Figure 3 and Table 6 indicates the drugs in the ground truth but are not predicted, while 'unseen' refers to the drugs predicted by the models but are not in the ground truth. First, our model has the best recommendation results by comparing the results of other models with respect to each visit, our model predicts the highest number and accuracy of correct drugs, and the least number of redundant drugs. Further, we calculated the DDI score of each drug combinations. The results show that the DDI score of our model and SafeDrug predicted drugs are lower than GAMENet and COGNet due to the DDI loss function. While ensuring DDI, our model has higher accuracy than SafeDrug, While GAMENet and COGNet improve the accuracy, and the increasing number of drugs leads to higher DDI rates. We analyze the DDIs between the unseen and correct drugs predicted by each model, and interestingly, we find out that the unseen drugs predicted by DAPSNet have fewer interactions with its correctly predicted drugs than other models, providing a good constraint on the DDI rates. Combined with the previous experimental results, this case study further verifies that the drug combinations recommended by our model make a trade-off between efficacy and safety.

Conclusion
In this work, we proposed DAPSNet, a novel drug recommendation model that first integrates the historical prescription information into encoding the PR and further retrieves the patients' different disease state similarities to jointly enhance the drug recommendation performance. Specifically, we learned the comprehensive PR from the patient's historical visit information through a novel PR module that incorporates code-and visit-level attention mechanisms. Furthermore, we retrieve the corresponding drug combinations according to the similarity between the patients and their own and other patients' historical disease states to obtain additional drug information, which improved the prediction accuracy. We designed an information constraint loss function based on the IB principle to constrain the PR and obtain a more robust model. The experimental results on the MIMIC-III dataset demonstrated that our DAPSNet is superior to the SOTA methods in accurately predicting drug combinations. Also, our model achieves a low DDI rate among the predicted drugs to ensure safe and effective recommendations.